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Agenda 

•What is explicit Multi GPU 
•API Introduction 
•Engine Requirements 
•Frame Pipelining - Case Study 


Problem With Implicit Multi GPU 


Ideal situation 

• Driver does its magic 

• Developer doesn't have 
to care 

• It just works 


Reality 

• Driver needs lots of hints 

• Clears, discards 

• Vendor specific APIs 

• Developer needs to 
understand what driver is 
trying to do 

• It still doesn't always fly 
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What is Explicit Multi-GPU? 

• Control cross GPU transfers 

• No unintended implicit transfers 

• Control what work is done on each GPU 

• Not just Alternate Frame Rendering (AFR) 
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DX12 Explicit Multi GPU 

•No more driver magic 

•There is no driver level support for AFR 

•Now you can do it better yourself, and 
much more! 

•No vendor specific APIs needed 
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Adapters - Linked Node Adapter 


ID3D12Device* 

I Node 0 ' 
Node 1~ B 
Node 2 ! 
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Adapters - Multiple Adapters 





ID3D12Device* 

X 


ID3D12Device* 




ID3D12Device* 


Cross Adapter 
Resource Heap 
( ID3D12Heap *) 
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Linked Node Adapter 


•When user has enabled use of multiple GPUs in 
display driver, linked node mode is enabled 

•IDXGIFactory: :EnumAdaptersl() sees one adapter 
•ID3D12Device: :GetNodeCount() tells node count 


•Nodes (GPUs) are referenced with affinity masks 


•Node 0 = Oxl 
•Node 1 = 0x2 
•Node 1 and 2 = 0x3 


0000 0001 


0000 0010 


0000 0011 
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Linked Node Features 


•Resource copies directly from discrete GPU to 
discrete GPU - not through system memory 


• Special support for AFR 
IDXGISwapChain3 : : ResizeBuffersl() allows 
utilization of other connections than PCIe when 
presenting frames 


Multi GPU link 


Good for multiple discrete GPUs! 


■ 

GPU 0 


GPU 1 

Li 


PCI Express 
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Linked Node Load Balancing 

•It's safe to assume that nodes are balanced for 
foreseeable future 
•Life is easy 
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Linked Node Load Balancing 

•It's safe to assume that nodes are balanced for 
foreseeable future 
•Life is easy 

• Heterogeneous nodes may be available some day 
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Infrastructure For Explicit M-GPU 

•Renderer has to be aware of multiple GPUs 

•Expose multiple GPUs at right level 

•Wrap command queues, resources, descriptors, 
gpu virtual addresses etc. for multiple GPUs 

•This can actually be the part that requires 
most effort 

•Once infrastructure exists, it's easier to 
experiment 


ink. 
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Multi Node APIs 

•With linked nodes, some things are very easy 
•Some interfaces are omni node (no node mask) 

•Starting with ID3D12Device 

•Some interfaces are multi node 

•Affinity mask can have more than one bit set 

•Root signatures, pipeline states and command signatures 
can be often just shared for all nodes 


ID3D12RootSignature* ID3D12PipelineState* ID3D12CommandSignature* 

NodeMask NodeMask NodeMask 
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Command Queues And Lists 

•Each node has its own 
ID3D12CommandQueue, i.e. "engine" 

•ID3D12CommandLists are also exclusive 
to single node 

•Command list pooling for each node is needed 


ID3D12CommandQueue * 

Node Mask Oxl 

D3D1 2_COMMA ND_ LIST_ TYPE_ DIRECT 



GDC 




GAME DEVELOPERS CONFERENCE March 14-18, 2016 • Expo: March 16-18, 2016 #G0C16 



Command List Pooling 


r 

ID3D12CommandList* 

D3D1 2_ COMMA ND_ LIST_ TYPE_ DIRECT 




ID3D12CommandQueue * 

D3D1 2_COMMAND_LIST_ TYPE_DIRECT 

ID3D12CommandQueue * 

D3D1 2_ COMMA ND_ LIST_ TYPE_DIRECT 
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Command List Pooling 


ID3D12CommandList* 


D3D12 COMMAND LIST TYPE DIRECT 



ID3D12CommandQueue 


D3D12 COMMAND LIST TYPE DIRECT 


ID3D12CommandQueue 


D3D12 COMMAND LIST TYPE DIRECT 
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Synchronization - Fences 

•Different command queues need to be 
synchronized when sharing resources 

•ID3D12Fence is the synchronization tool 



Fences 

•Application must avoid access conflicts 

•Application must ensure that all engines see 
shared resources in same state 


ID3D12CommandQueue* 

Write 

Signal 

Do something 

ID3D12CommandQueue* 

Wait 

Read 


ID3D12Fence * 


ID3D12Resource* 
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Copy Engine(s) 

• ID3D12CommandQueue with 
D3D12_ COMMA ND_ LIST_ TYPE_ COPY 

• Cross GPU copies parallel to other processing 

• Remember to double buffer the resources 


GPU 1 

Graphics 

Frame 0 

Frame 1 

Frame 2 


Frame 3 


Frame 4 c 



Copy 

Idle F0 

Idle 

FI IdLe 

F2 

IdLe F3 


IdLe F4 


GPU 0 

Graphics 

( F-2) (F-l) 

F0 

FI 


F2 



F3 

F4 
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Cross Node Sharing Tiers 

• ID3D12Device has tiers for cross node sharing 

• Tier 1 supports only cross node copy operations 

• ID3D12GraphicsCommandList: :CopyResource() etc 

• Tier 2 supports cross node SRV/CBV/UAV access 

• While SRV/CBV/UAV access may seem 
convenient, try whether using parallel copy 
engines would be more efficient 


ink. 


Resources 

•Resources and descriptors need most 
attention 

•Resources/heaps have two separate node 
masks 

• CreationNodeMask is single node mask 
•VisibleNodeMask is multi node mask 

•Descriptor heap is exclusive to single node 
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Resources - Visibility 

Node Oxl memory 


ID3D1 2DescriptorHeap * 
Node Mask 0x1 


Node 0x2 memory 


ID3D12Heap * 
CreationNodeMask 0x1 
VisibleNodeMask 0x1 


ID3D12DescriptorHeap * 
NodeMask 0x2 


ID3D12Heap* 
CreationNodeMask 0x2 
VisibleNodeMask 0x2 




GDC 




GAME DEVELOPERS CONFERENCE March 14-18, 2016 • Expo: March 16-18, 2016 #G0C16 



Resources - Visibility 
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Resources - Visibility 


Node Oxl memory 


ID3D1 2DescriptorHeap * 
Node Mask 0x1 


Node 0x2 memory 


ID3D12DescriptorHeap * 
NodeMask 0x2 


ID3D12Heap* 
CreationNodeMask 0x1 
VisibleNodeMask 0x1 


ID3D12Heap* 
CreationNodeMask 0x1 
VisibleNodeMask 



ID3D12Heap* 
CreationNodeMask 0x2 
VisibleNodeMask 0x2 
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Resources - Assets 

•Upload art assets (vertex data, textures 
etc.) to nodes that need them 

•It's often convenient to upload your assets to 
all nodes for easy experimentation 

•AFR needs assets on all nodes 

•Create a unique resource for each node, 
not just one that would be visible to others 
(with proper VisibleNodeMask ) 


ink. 


Resources - AFR Targets 

•AFR requires all render targets be 
duplicated for each node 

• Need robust cycling mechanism 

•Again, a unique resource for each node, 
not one resource visible to all nodes 
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AFR Isn't For Everyone... 

•Temporal techniques make AFR difficult 

•Too many inter-frame dependencies can kill the 
performance 

• Explicit or implicit 
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AFR Workflow Problem 

Ideal 


GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8 


GPU 0 


Frame 1 

Frame 3 

Frame 5 

Frame 7 

Frame 9 

Screen 

(F-2) 

(F-l) 

F0 

FI F2 

F3 

F4 

F5 F6 

F7 

F8 
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AFR Workflow Problem 

Ideal 


GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8 


GPU 0 


Frame 1 

Frame 3 

Frame 5 

Frame 7 

Frame 9 

Screen 

(F-2) 

(F-l) 

F0 

FI F2 

F3 

F4 

F5 F6 

F7 

F8 


Dependencies between frames 


GPU 1 

Graphics 

Frame 

0 


idle | Frame 

2 

i die (Frame 

4 

idle (Frame 

6 

Copy 


F0- >F1 

Idle 

F2- >F3 

Idle 

F4- >F5 

Idle 

F6- >F7 

GPU 0 

Graphics 



Frame 1 

idle (Frame 

3 

idle (Frame ! 

5 


Copy 


Idle 

Fl- >F2 

Idle 

F3- >F4 

Idle 

F5- >F6 

IdLe 

Screen 

(F-l) 

F0 

Fl 

F2 

F3 

F4 

F5 
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AFR Workflow Problem 

Ideal 


GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8 


GPU 0 


Frame 1 

Frame 3 

Frame 5 

Frame 7 

Frame 9 

Screen 

(F-2) 

(F-l) 

F0 

FI F2 

F3 

F4 

F5 F6 

F7 

F8 


Dependencies between frames 


GPU 1 

Graphics 

Frame 

0 


Frame 

2 



Frame 

^^HFram 

2 

Copy 


F0- >F1 

IdLe 

F2- 

>F3 

IdLe 

F4->F5 IdLe 

F6- >F7 

GPU 0 

Graphics 


IdLe 

Frame 1 idle 

Frame 

B ix2UuSlJ9 

Frame 5 


Copy 


IdLe 

Fl- >F2 

IdLe 



F3- >F4 

IdLe F5- >F6 

IdLe 

Screen 

(F-l) 

F0 

Fl 



F2 

F3 fl HF4 

F5 



New Possibility - Frame Pipelining 

•Pipeline rendering of frames 

•Begin frame on one GPU 

•Transfer work to next GPU to finish rendering 
and present 

•The GPUs and copy engines form a pipeline 


GPU 1 

Graphics 

Frame 

0 

Frame 1 

Frame 2 

Frame 3 

Frame 4 

Frame 5 


Copy 


IdLe F0 

IdLe FI 

IdLe F2 

IdLe F3 

IdLe 

F4 
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Graphics 
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(F-l) 

F0 

FI 

F2 
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New Possibility - Frame Pipelining 

•Pipeline rendering of frames 

•Begin frame on one GPU 

•Transfer work to next GPU to finish rendering 
and present 

•The GPUs and copy engines form a pipeline 
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Pipelining - Simple Dependencies 

•No back and forth dependencies between 
GPUs 

• Helps to minimize waits 

• Easier to do large cross GPU data transfers 
without reducing frame rate 

•Unless copying takes longer than actual work, 
it affects only latency, not frame rate 


ink. 
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Pipelining - Temporal techniques 

•Temporal techniques allowed without penalties 




Pipelining - Temporal techniques 


•Temporal techniques allowed without penalties 

•Limitation: GPUs at beginning of pipeline cannot 
use resources produced further down the pipeline 
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Pipelining - Something More 

• Instead doing the same faster, do 
something more 

. GI 

• Ray tracing 

• Physics 

• Etc. 
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Pipelining - Workload Distribution 

•Needs a good point to split the frame 

•Cross GPU copies are slow regardless of 
parallel copy engines 

•<8 GB/s on 8xPCIe3, 64 MB consumes at least 8 ms 

• Doing some passes on both GPUs instead 
of transferring the results can be an option 


ink. 
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Frame Pipelining Workflow 


Ideal 


GPU 1 Graphics 

Frame 

0 Frame 1 

Frame 2 

HFrame 3 

Frame 4 Frame 5 

Copy 


IdLe F0 

IdLe F 1 

IdLe 

F2I3B2&81 

IdLe F3 

IdLe F4 


GPU 0 Graphics 

( F-2) 

(F-l) 

F0 
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F3 

F4 
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F0 


FI 

F2 
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Unbalanced work 
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Frame Pipelining Workflow 

Ideal 


GPU 1 Graphics 

Frame 

0 Frame 1 

Frame 2 

HFrame 3 

Frame 4 Frame 5 

Copy 


IdLe F0 

IdLe F 1 

IdLe 

F2I3B2&81 

IdLe F3 

IdLe F4 



GPU 0 Graphics 
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Unbalanced work 


GPU 1 Graphics Frame 0 

Copy (F-l) IdLe F0 

GPU 0 Graphics ^B (F-l) 
Screen ( F-2) 


Frame 1 




Pipelining - Possible Problems 

•Workload balance between GPUs depends 
also on scene content 

•It's never perfect, but can be reasonable 

•Latency can be a problem like in AFR 

•Scaling for 3 or 4 GPUs requires separate 
solutions 
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Frame Pipelining Case Study 

•Microsoft DX12 miniengine 

• Pre-depth 

• SSAO 

• Sun shadow map 

• Primary pass 

• Particles 

• Motion blur 

• Bloom 
. FXAA 



1 1 Ik 
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Frame Pipelining Case Study 

• As a stress test, 3840x2160 screen and 4k by 4k 
sun shadow map resolutions were used 

• Generated on first GPU: 


Predepth 

D32_FLOAT 

31.6 MB 

5.3 ms 

Linear Depth 

R16_FLOAT 

15.8 MB 

2.6 ms 

SSAO 

R8_UNORM 

7.9 MB 

1.3 ms 

Sun Shadow Map 

D16JJNORM 

32 MB 

5.3 ms 

Total 

87.3 MB 

14.6 ms 


ink. 
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Frame Pipelining Case Study - Performance 


FPS 


22 


■ Single GPU 

k Two GPUs 

Two GPUs using Copy Engine 
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Pipelining Case Study - GPUView 


Original single GPU workflow 
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Pipelining Case Study - GPUView 

Two GPUs pipelined with copy engine 


Adapter [NVIDIA GeForce GTX 980 Til 

Hardware Queue 


ode Oxl 


Hardware Queue 

30 

Hardware Queue 

Copy 


Hardware Queue 

Copy 


Hardware Queue 

Copy 

Hardware Queue 

Copy 








1 


1 1 1 


1 






1 


1 1 1 


1 






□ 

_L 





















Frame Pipelining Case Study 

•1.7x framerate from single to dual GPU 

•Pretty even workload distribution, but it's 
content dependent 

•Cost of copying step would limit frame 
rate to about 60 fps on 8xPCIe 3.0 system 
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Pipelining - Hiding Copy Latency 

•Break up copy work into smaller chuncks 

•Overlap with other work for the same frame 

•More and smaller command lists 

•Remember guidelines from the "Practical 
DirectX 12" 

• In the case study, the ~15 ms extra 
latency from copies can be almost entirely 
hidden 


ink. 
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Hiding Copy Latency - GPUView 

One frame 


Adapter [NVIDIA GeForce GTX 980 Ti] 

Hardware Queue 
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Summary 

• No more driver magic 

• You're in control of AFR 

• Try pipelining with temporal techniques! 

• Remember copy engines! 

• You can do anything you want with that 
extra GPU - Surprise us! 
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Questions? 

• jsjoholm@nvidia.com 


